-
Notifications
You must be signed in to change notification settings - Fork 225
Add peer access control for DeviceMemoryResource #1289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bd48cf9 to
634f931
Compare
|
/ok to test 84378f4 |
This comment has been minimized.
This comment has been minimized.
| from .._device import Device | ||
|
|
||
| # Convert all devices to device IDs | ||
| cdef set target_ids = {Device(dev).device_id for dev in devices} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should filter this set by cudaDeviceCanAccessPeer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the latest. I added a check that raises an error if a non-accessible device is added to the list. Unfortunately, I don't see a good way to test it.
| bint _mempool_owned | ||
| IPCData _ipc_data | ||
| object _attributes | ||
| object _peer_accessible_by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will address peer access with IPC memory pools in a follow-up change. The peer access attributes are not inherited when an allocation is sent to another process via IPC, but access can be set. It will require a new test and possibly a small code change.
|
/ok to test b155df7 |
| from cuda.core.experimental._utils.cuda_utils import CUDAError | ||
| from helpers.buffers import PatternGen, compare_buffer_to_constant, make_scratch_buffer | ||
|
|
||
| NBYTES = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this a local variable to the unit test instead of a global. bonus points if you make the pytest test multiple memory block sizes.
|
This merge brings in the latest changes from main (commit a4b285a) including the peer access control for DeviceMemoryResource (NVIDIA#1289). The merge combines: - Latest main changes (a4b285a) - Experimental namespace deprecation work from this branch All files remain moved from cuda.core.experimental.* to cuda.core.* with backward compatibility stubs maintained.
Description
closes #479
Add peer access control for DeviceMemoryResource
This PR implements peer access control for
DeviceMemoryResource, allowing memory pool allocations to be accessed from devices other than the owner device.Overview
Memory pools created by
DeviceMemoryResourcecan now grant peer access permissions to other GPUs, enabling multi-GPU workflows where allocations on one device need to be accessed by kernels running on another device.Key Changes
1. New
peer_accessible_bypropertyExample usage:
2. Implementation details
cuMemPoolSetAccessto modify peer access permissions at runtimeCUmemAccessDescstructures to the driver APICU_MEM_ACCESS_FLAGS_PROT_READWRITE) and revoking (CU_MEM_ACCESS_FLAGS_PROT_NONE) access in a single transaction3. Debugging support
DMR_mempool_get_access()helper function to probe actual access permissions viacuMemPoolGetAccess4. Workaround for driver bug (nvbug 5698116)
__dealloc__to reset peer access before destroying memory poolsTechnical Notes
malloc/freeimports fromlibc.stdlibfor C array allocationTesting
This feature is designed to work with the existing multi-GPU test suite and enables peer access scenarios for buffer copy operations between devices.